library(DBI)library(RPostgres)db <-dbConnect(Postgres(),dbname ="dbintro",host ="localhost", # Make sure to specify the hostport =5432L,user ="postgres",password ="pgpasswd")
Let’s fill this database with the nycflights13 flights data:
Inside the Docker container (that is the name of a server running via docker), you can list tables with:
\dt
You can run SQL operations in here without R:
SELECT*FROM airports;# ^
Note that commands are only executed when SQL encounters a ;!
Exit a long print with q
Working with PostgreSQL: pretty similar to SQLite…
But not the same…
Error : Failed to fetch row : ERROR: table "df1" does not exist
Error : Failed to fetch row : ERROR: table "df2" does not exist
dbExecute(db,"CREATE TABLE df1 ( id SERIAL PRIMARY KEY, capital_letters VARCHAR(1) CHECK (capital_letters ~ '^[A-Z]$'), my_date DATE CHECK (my_date > '2000-01-01') )")
[1] 0
Working with PostgreSQL and dbplyr: the same as SQLite
I copied this from the last session, and it works just as well:
tbl(db, "flights") |>inner_join(tbl(db, "planes"), by ="tailnum", suffix =c("", "_plane")) |>mutate(plane_age = year - year_plane) |>select(arr_delay, plane_age) |>filter(!is.na(arr_delay),!is.na(plane_age)) |>collect() |>group_by(plane_age) |>summarise(avg_delay =mean(arr_delay)) |>ggplot(aes(x = plane_age, y = avg_delay)) +geom_point() +geom_smooth(method ="lm", formula = y ~ x)
User-based authentication
Imagine this:
you have a team of researchers
some are responsible for data gathering (annotation, webscraping, conducting survey waves, etc.)
some are responsible for data analysis
you are responsible for data management and public reporting
You want to make sure that:
researchers in team gathering do not overwrite each others changes
team analysis always has the newest data
the summary data on the website is online as soon as possible
with PostgreSQL you can make sure the two groups don disrupt each other and the most recent results are pull from the database automatically
User-based authentication: users
We create three new users:
one for the one researchers in the gatherer group
one for the one analyser in the analysis group
one called “reader” which represents the general public
dbExecute(db, "CREATE USER gatherer WITH ENCRYPTED PASSWORD 'supersecret'")
[1] 0
dbExecute(db, "CREATE USER analyser WITH ENCRYPTED PASSWORD 'supersecret'")
[1] 0
dbExecute(db, "CREATE USER reader WITH ENCRYPTED PASSWORD 'supersecret'")
[1] 0
We can look at users:
dbGetQuery(db, "SELECT usename FROM pg_user")
usename
1 postgres
2 gatherer
3 analyser
4 reader
User-based authentication: users
Let’s log in as analyser:
db_analyser <-dbConnect(Postgres(),dbname ="dbintro",host ="localhost", # Make sure to specify the hostport =5432L,user ="analyser",password ="supersecret")
User-based authentication: roles
So far, neither of the new users can do anything:
tbl(db_analyser, "flights")
Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
ℹ Using SQL: SELECT * FROM "flights" AS "q01" WHERE (0 = 1)
Caused by error:
! Failed to fetch row : ERROR: permission denied for table flights
They have to be assigned roles first
User-based authentication: roles
We first give the “gatherer” user permission to update a table:
dbExecute(db, "GRANT UPDATE,INSERT ON flights TO gatherer")
[1] 0
Since we do not fully trust the analyser, we give them read-only access to flights, but access to everything in df1, which we pretend that the results of the analysis are stored in:
dbExecute(db, "GRANT SELECT ON flights TO analyser")
[1] 0
dbExecute(db, "GRANT ALL PRIVILEGES ON df1 TO analyser")
[1] 0
Finally, the public gets selected access to only some columns in the results table:
dbExecute(db, "GRANT SELECT (capital_letters) ON df1 TO reader")
[1] 0
User-based authentication: roles (analyser)
Still logged in as the analyser, let’s try to access the data again:
dbExecute(db_analyser,"INSERT INTO flights (year) VALUES (2013)")
Error: Failed to fetch row : ERROR: permission denied for table flights
We can add data to the tables that we were granted access to though:
dbExecute(db_analyser,"INSERT INTO df1 VALUES (1, 'A', '2025-07-11')")
[1] 1
User-based authentication: roles (gatherer)
Logging in as the gatherer, we can add new cases:
db_gatherer <-dbConnect(Postgres(),dbname ="dbintro",host ="localhost", # Make sure to specify the hostport =5432L,user ="gatherer",password ="supersecret")dbExecute(db_gatherer,"INSERT INTO flights (year) VALUES (2013)")
[1] 1
User-based authentication: roles (reader)
Logging in as the reader, we can can’t change anything, but can only read specific columns:
db_reader <-dbConnect(Postgres(),dbname ="dbintro",host ="localhost", # Make sure to specify the hostport =5432L,user ="reader",password ="supersecret")
dbGetQuery(db_reader, "SELECT * FROM df1")
Error: Failed to fetch row : ERROR: permission denied for table df1
dbGetQuery(db_reader, "SELECT capital_letters FROM df1")
capital_letters
1 A
Now we could give out this user to the public without the need to worry that they change or read anything they are not supposed to.
Exercises 1
Using the PostgreSQL database or the SQLite database from session 4:
Right join results_state and facts using dbplyr instead of dbGetQuery
Recreate the table results_state_time by querying and joining from db (using dbplyr instead of dbGetQuery).
Recreate results_state_facts using dbplyr. Don’t forget to add total_votes and pct_votes
Extract the SQL query from your code in 3. and run it with dbGetQuery
Working with text data in AmCAT
Why AmCAT
Optimized to store, annotate, preprocess, search, share and present text data collections
docker-compose-f 05_Scaling_Reporting_and_Database_Software/data/docker-compose-amcat.yml up -d
Connecting from the terminal
Creating a test index:
docker exec -it amcat4 amcat4 create-test-index
Configure the AmCAT server:
docker exec -it amcat4 amcat4 config#> Reading/writing settings from .env#> #> host: Host this instance is served at (needed for checking tokens)#> The current value for host is http://localhost/amcat.#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: #> #> elastic_password: Elasticsearch password. This the password for the #> 'elastic' user when Elastic xpack security is enabled#> The current value for elastic_password is None.#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: #> #> elastic_host: Elasticsearch host. Default: https://localhost:9200 if #> elastic_password is set, http://localhost:9200 otherwise#> The current value for elastic_host is http://elastic8:9200.#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: #> #> elastic_verify_ssl: Elasticsearch verify SSL (only used if #> elastic_password is set). Default: True unless host is localhost)#> The current value for elastic_verify_ssl is True.#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: #> #> system_index: Elasticsearch index to store authorization information #> in#> The current value for system_index is amcat4_system.#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: #> #> auth: Do we require authorization?#> Possible choices:#> - no_auth: everyone (that can reach the server) can do anything #> they want#> - allow_guests: everyone can use the server, dependent on #> index-level guest_role authorization settings#> - allow_authenticated_guests: everyone can use the server, if they #> have a valid middlecat login,#> and dependent on index-level guest_role authorization settings#> - authorized_users_only: only people with a valid middlecat login #> and an explicit server role can use the server#> #> The current value for auth is AuthOptions.no_auth.#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: #> #> middlecat_url: Middlecat server to trust as ID provider#> The current value for middlecat_url is https://middlecat.net.#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: #> #> admin_email: Email address for a hardcoded admin email (useful for #> setup and recovery)#> The current value for admin_email is None.#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: #> #> minio_host: None#> The current value for minio_host is None.#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: #> #> minio_tls: None#> The current value for minio_tls is False.#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: #> #> minio_access_key: None#> The current value for minio_access_key is None.#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: #> #> minio_secret_key: None#> The current value for minio_secret_key is None.#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: #> *** Written .env file to .env ***
Connecting from R
We need to log in, which you can do without a user by default:
# A tibble: 200 × 3
.id title date
<id_col> <chr> <dttm>
1 9Vs…TGr 1790: George Washington 1790-01-08 00:00:00
2 9ls…TGr 1790: George Washington 1790-12-08 00:00:00
3 91s…TGr 1791: George Washington 1791-10-25 00:00:00
4 -Fs…TGr 1792: George Washington 1792-11-06 00:00:00
5 -Vs…TGr 1793: George Washington 1793-12-03 00:00:00
6 -ls…TGr 1794: George Washington 1794-11-19 00:00:00
7 -1s…TGr 1795: George Washington 1795-12-08 00:00:00
8 _Fs…TGr 1796: George Washington 1796-12-07 00:00:00
9 _Vs…TGr 1797: John Adams 1797-11-22 00:00:00
10 _ls…TGr 1798: John Adams 1798-12-08 00:00:00
# ℹ 190 more rows
Dataset
Load data from Rauh and Schwalbach (2020) into AmCAT:
if (!"houseofcommons"%in%list_indexes()$id) { corp_hoc_df <-readRDS("../03_Working_with_Files/data/Corp_HouseOfCommons_V2.rds") |>mutate(date =as.Date(date)) |>rename(title = agenda) |>filter(!is.na(date)) |>replace_na(list(title ="", text ="")) |>select(-party.facts.id)# speed this up by only using a sample# corp_hoc_df <- corp_hoc_df |> # sample_n(size = 500)# define types of fields fields =list(date ="date",text ="text",title ="text", speechnumber ="integer", speaker ="keyword", party ="keyword",chair ="boolean", terms ="integer", parliament ="keyword", iso3country ="keyword" )# create the indexcreate_index(index ="houseofcommons", name ="House Of Commons", description ="HouseOfCommons", create_fields = fields)# upload the dataupload_documents("houseofcommons", documents = corp_hoc_df, chunk_size =1000, verbose =TRUE) }
# A tibble: 200 × 3
.id date title
<id_col> <dttm> <chr>
1 TT0…mBx 1988-12-01 00:00:00 Engagements [Oral Answers To Questions > Prime …
2 fD0…mBy 1988-12-01 00:00:00 Patrick Ryan
3 jz0…mBy 1988-12-01 00:00:00 Patrick Ryan
4 pD0…mBy 1988-12-01 00:00:00 Business Of The House
5 tj0…mBy 1988-12-01 00:00:00 Business Of The House
6 0j0…mBy 1988-12-01 00:00:00 Business Of The House
7 _z0…mBz 1988-12-01 00:00:00 Fisheries
8 BD0…mFz 1988-12-01 00:00:00 Fisheries
9 Bz0…mFz 1988-12-01 00:00:00 Fisheries
10 Cz0…mFz 1988-12-01 00:00:00 Fisheries
# ℹ 190 more rows
query_documents(index ="houseofcommons", queries ='"European Union" OR EU')
# A tibble: 200 × 3
.id date title
<id_col> <dttm> <chr>
1 Oj0…XJr 1988-12-19 00:00:00 Adjournment (Christmas)
2 8z0…JZJ 1989-02-07 00:00:00 Nato Secretary-General [Oral Answers To Questio…
3 4T0…VqA 1988-11-25 00:00:00 Foreign Affairs And Defence [Orders Of The Day …
4 _j0…VqA 1988-11-25 00:00:00 Foreign Affairs And Defence [Orders Of The Day …
5 Aj0…VuB 1988-11-25 00:00:00 Foreign Affairs And Defence [Orders Of The Day …
6 Kz0…VuB 1988-11-25 00:00:00 Debate On The Address [Orders Of The Day > Orde…
7 Fj0…1pw 1988-11-24 00:00:00 Business Of The House [Prayers > Bills Presente…
8 0T0…MJ1 1989-03-21 00:00:00 Advisory Committee For Wales [Orders Of The Day…
9 ez0…fMs 1989-05-18 00:00:00 Developments In The European Community
10 hT0…fMs 1989-05-18 00:00:00 Developments In The European Community
# ℹ 190 more rows
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 simple 1 1 1.12 NaN NaN
2 complex 1.08 1.10 1 NaN NaN
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 in_memory 1 1.09 1 NaN NaN
2 in_db 1 1 1.15 NaN NaN
Fix bottlenecks
Without an index, the search in the database takes longer. You can create it automatically with copy_to() (see this)
dbWriteTable(db, "flights_w_index", flights, overwrite =TRUE)dbWriteTable(db, "weather_w_index", weather, overwrite =TRUE)dbExecute(db, "CREATE INDEX ON flights_w_index (time_hour)")
[1] 0
dbExecute(db, "CREATE INDEX ON weather_w_index (time_hour)")
# A tibble: 3 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 in_memory 1 1 1 NaN NaN
2 in_db 1.05 1.08 1.20 NaN NaN
3 in_db_w_index 1.23 1.32 1.56 NaN NaN
Data Management: Summary
What for?
enables new research questions by being able to combine and clean new data
let’s you sleep easy at night
makes your work transparent to others and yourself
thinking about reproducibility from the start
When to use files vs. databases
often not clear from the start
working with files is easier as a solo researchers (who keeps a clean project directory)
when data changes regularly it makes sense to not add to the same file again and again
when several researchers collaborate, a database solves many issues
when you have text data, vectors, or complex elements, NoSQL databases like AmCAT or MongoDB can make sense
When done, dbDisconnect
Whenever you are done working with a database, you should disconnect from it:
Rauh, Christian, and Jan Schwalbach. 2020. “The ParlSpeech V2 data set: Full-text corpora of 6.3 million parliamentary speeches in the key legislative chambers of nine representative democracies.” Harvard Dataverse. https://doi.org/10.7910/DVN/L4OAKN.
Weidmann, Nils B. 2023. Data Management for SocialScientists: FromFiles to Databases. 1st ed. Cambridge University Press. https://doi.org/10.1017/9781108990424.